DROPS

Document

DOI: 10.4230/LIPIcs.DISC.2023.17

Modular Recoverable Mutual Exclusion Under System-Wide Failures

Authors: Sahil Dhoked, Wojciech Golab, and Neeraj Mittal

Published in: LIPIcs, Volume 281, 37th International Symposium on Distributed Computing (DISC 2023)

Abstract

Recoverable mutual exclusion (RME) is a fault-tolerant variation of Dijkstra’s classic mutual exclusion (ME) problem that allows processes to fail by crashing as long as they recover eventually. A growing body of literature on this topic, starting with the problem formulation by Golab and Ramaraju (PODC'16), examines the cost of solving the RME problem, which is quantified by counting the expensive shared memory operations called remote memory references (RMRs), under a variety of conditions. Published results show that the RMR complexity of RME algorithms, among other factors, depends crucially on the failure model used: individual process versus system-wide. Recent work by Golab and Hendler (PODC'18) also suggests that explicit failure detection can be helpful in attaining constant RMR solutions to the RME problem in the system-wide failure model. Follow-up work by Jayanti, Jayanti, and Joshi (SPAA'23) shows that such a solution exists even without employing a failure detector, albeit this solution uses a more complex algorithmic approach. In this work, we dive deeper into the study of RMR-optimal RME algorithms for the system-wide failure model, and present contributions along multiple directions. First, we introduce the notion of withdrawing from a lock acquisition rather than resetting the lock. We use this notion to design a withdrawable RME algorithm with optimal O(1) RMR complexity for both cache-coherent (CC) and distributed shared memory (DSM) models in a modular way without using an explicit failure detector. In some sense, our technique marries the simplicity of Golab and Hendler’s algorithm with Jayanti, Jayanti and Joshi’s weaker system model. Second, we present a variation of our algorithm that supports fully dynamic process participation (i.e., both joining and leaving) in the CC model, while maintaining its constant RMR complexity. We show experimentally that our algorithm is substantially faster than Jayanti, Jayanti, and Joshi’s algorithm despite having stronger correctness properties. Finally, we establish an impossibility result for fully dynamic RME algorithms with bounded RMR complexity in the DSM model that are adaptive with respect to space, and provide a wait-free withdraw section.

Cite as

Sahil Dhoked, Wojciech Golab, and Neeraj Mittal. Modular Recoverable Mutual Exclusion Under System-Wide Failures. In 37th International Symposium on Distributed Computing (DISC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 281, pp. 17:1-17:24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{dhoked_et_al:LIPIcs.DISC.2023.17,
  author =	{Dhoked, Sahil and Golab, Wojciech and Mittal, Neeraj},
  title =	{{Modular Recoverable Mutual Exclusion Under System-Wide Failures}},
  booktitle =	{37th International Symposium on Distributed Computing (DISC 2023)},
  pages =	{17:1--17:24},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-301-0},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{281},
  editor =	{Oshman, Rotem},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.DISC.2023.17},
  URN =		{urn:nbn:de:0030-drops-191431},
  doi =		{10.4230/LIPIcs.DISC.2023.17},
  annote =	{Keywords: mutual exclusion, shared memory, persistent memory, fault tolerance, system-wide failure, RMR complexity, dynamic joining, dynamic leaving}
}

Document

Brief Announcement

DOI: 10.4230/LIPIcs.DISC.2023.38

Brief Announcement: On Implementing Wear Leveling in Persistent Synchronization Structures

Authors: Jakeb Chouinard, Kush Kansara, Xialin Liu, Nihal Potdar, and Wojciech Golab

Published in: LIPIcs, Volume 281, 37th International Symposium on Distributed Computing (DISC 2023)

Abstract

The last decade has witnessed an explosion of research on persistent memory, which combines the low access latency of dynamic random access memory (DRAM) with the durability of secondary storage. Intel’s implementation of persistent memory, called Optane, comes close to realizing the game-changing potential of persistent memory in terms of performance; however, it also suffers from limited endurance and relies on a proprietary wear leveling mechanism to mitigate memory cell wear-out. The traditional embedded approach to wear leveling, in which the storage device itself maps logical addresses to physical addresses, can be fast and energy-efficient, but it is also relatively inflexible and can lead to missed opportunities for optimization. An alternative school of thought, exemplified by "open channel" solid state drives (SSDs), delegates responsibility for wear leveling to software, where it can be tailored to specific applications. In this research, we consider a hypothetical hardware platform where the same paradigm is applied to the persistent memory device, and ask how the wear leveling mechanism can be co-designed with synchronization structures that generate highly skewed memory access patterns. Building on the recent work of Liu and Golab, we implement an improved wear leveling atomic counter by leveraging hardware transactional memory in a novel way. Our solution is close to optimal with respect to both space complexity and measured performance.

Cite as

Jakeb Chouinard, Kush Kansara, Xialin Liu, Nihal Potdar, and Wojciech Golab. Brief Announcement: On Implementing Wear Leveling in Persistent Synchronization Structures. In 37th International Symposium on Distributed Computing (DISC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 281, pp. 38:1-38:7, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{chouinard_et_al:LIPIcs.DISC.2023.38,
  author =	{Chouinard, Jakeb and Kansara, Kush and Liu, Xialin and Potdar, Nihal and Golab, Wojciech},
  title =	{{Brief Announcement: On Implementing Wear Leveling in Persistent Synchronization Structures}},
  booktitle =	{37th International Symposium on Distributed Computing (DISC 2023)},
  pages =	{38:1--38:7},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-301-0},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{281},
  editor =	{Oshman, Rotem},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.DISC.2023.38},
  URN =		{urn:nbn:de:0030-drops-191640},
  doi =		{10.4230/LIPIcs.DISC.2023.38},
  annote =	{Keywords: persistent memory, transactional memory, wear leveling, atomic counter, concurrency, fault tolerance, theory}
}

Document

DOI: 10.4230/LIPIcs.DISC.2021.29

Detectable Sequential Specifications for Recoverable Shared Objects

Authors: Nan Li and Wojciech Golab

Published in: LIPIcs, Volume 209, 35th International Symposium on Distributed Computing (DISC 2021)

Abstract

The recent commercial release of persistent main memory by Intel has sparked intense interest in recoverable concurrent objects. Such objects maintain state in persistent memory, and can be recovered directly following a system-wide crash failure, as opposed to being painstakingly rebuilt using recovery state saved in slower secondary storage. Specifying and implementing recoverable objects is technically challenging on current generation hardware precisely because the top layers of the memory hierarchy (CPU registers and cache) remain volatile, which causes application threads to lose critical execution state during a failure. For example, a thread that completes an operation on a shared object and then crashes may have difficulty determining whether this operation took effect, and if so, what response it returned. Friedman, Herlihy, Marathe, and Petrank (DISC'17) recently proposed that this difficulty can be alleviated by making the recoverable objects detectable, meaning that during recovery, they can resolve the status of an operation that was interrupted by a failure. In this paper, we formalize this important concept using a detectable sequential specification (DSS), which augments an object’s interface with auxiliary methods that threads use to first declare their need for detectability, and then perform detection if needed after a failure. Our contribution is closely related to the nesting-safe recoverable linearizability (NRL) framework of Attiya, Ben-Baruch, and Hendler (PODC'18), which follows an orthogonal approach based on ordinary sequential specifications combined with a novel correctness condition. Compared to NRL, our DSS-based approach is more portable across different models of distributed computation, compatible with several existing linearizability-like correctness conditions, less reliant on assumptions regarding the system, and more flexible in the sense that it allows applications to request detectability on demand. On the other hand, application code assumes full responsibility for nesting DSS-based objects. As a proof of concept, we demonstrate the DSS in action by presenting a detectable recoverable lock-free queue algorithm and evaluating its performance on a multiprocessor equipped with Intel Optane persistent memory.

Cite as

Nan Li and Wojciech Golab. Detectable Sequential Specifications for Recoverable Shared Objects. In 35th International Symposium on Distributed Computing (DISC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 209, pp. 29:1-29:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{li_et_al:LIPIcs.DISC.2021.29,
  author =	{Li, Nan and Golab, Wojciech},
  title =	{{Detectable Sequential Specifications for Recoverable Shared Objects}},
  booktitle =	{35th International Symposium on Distributed Computing (DISC 2021)},
  pages =	{29:1--29:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-210-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{209},
  editor =	{Gilbert, Seth},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.DISC.2021.29},
  URN =		{urn:nbn:de:0030-drops-148311},
  doi =		{10.4230/LIPIcs.DISC.2021.29},
  annote =	{Keywords: persistent memory, concurrency, fault tolerance, correctness, detectability}
}

Document

DOI: 10.4230/LIPIcs.OPODIS.2019.19

Toward Linearizability Testing for Multi-Word Persistent Synchronization Primitives

Authors: Diego Cepeda, Sakib Chowdhury, Nan Li, Raphael Lopez, Xinzhe Wang, and Wojciech Golab

Published in: LIPIcs, Volume 153, 23rd International Conference on Principles of Distributed Systems (OPODIS 2019)

Abstract

Persistent memory makes it possible to recover in-memory data structures following a failure instead of rebuilding them from state saved in slow secondary storage. Implementing such recoverable data structures correctly is challenging as their underlying algorithms must deal with both parallelism and failures, which makes them especially susceptible to programming errors. Traditional proofs of correctness should therefore be combined with other methods, such as model checking or software testing, to minimize the likelihood of uncaught defects. This research focuses specifically on the algorithmic principles of software testing, particularly linearizability analysis, for multi-word persistent synchronization primitives such as conditional swap operations. We describe an efficient decision procedure for linearizability in this context, and discuss its practical applications in detecting previously-unknown bugs in implementations of multi-word persistent primitives.

Cite as

Diego Cepeda, Sakib Chowdhury, Nan Li, Raphael Lopez, Xinzhe Wang, and Wojciech Golab. Toward Linearizability Testing for Multi-Word Persistent Synchronization Primitives. In 23rd International Conference on Principles of Distributed Systems (OPODIS 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 153, pp. 19:1-19:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{cepeda_et_al:LIPIcs.OPODIS.2019.19,
  author =	{Cepeda, Diego and Chowdhury, Sakib and Li, Nan and Lopez, Raphael and Wang, Xinzhe and Golab, Wojciech},
  title =	{{Toward Linearizability Testing for Multi-Word Persistent Synchronization Primitives}},
  booktitle =	{23rd International Conference on Principles of Distributed Systems (OPODIS 2019)},
  pages =	{19:1--19:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-133-7},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{153},
  editor =	{Felber, Pascal and Friedman, Roy and Gilbert, Seth and Miller, Avery},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2019.19},
  URN =		{urn:nbn:de:0030-drops-118050},
  doi =		{10.4230/LIPIcs.OPODIS.2019.19},
  annote =	{Keywords: Shared memory, persistent memory, synchronization, multi-word primitives, concurrency, correctness, software testing}
}

@InProceedings{cepeda_et_al:LIPIcs.OPODIS.2019.19,
  author =	{Cepeda, Diego and Chowdhury, Sakib and Li, Nan and Lopez, Raphael and Wang, Xinzhe and Golab, Wojciech},
  title =	{{Toward Linearizability Testing for Multi-Word Persistent Synchronization Primitives}},
  booktitle =	{23rd International Conference on Principles of Distributed Systems (OPODIS 2019)},
  pages =	{19:1--19:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-133-7},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{153},
  editor =	{Felber, Pascal and Friedman, Roy and Gilbert, Seth and Miller, Avery},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2019.19},
  URN =		{urn:nbn:de:0030-drops-118050},
  doi =		{10.4230/LIPIcs.OPODIS.2019.19},
  annote =	{Keywords: Shared memory, persistent memory, synchronization, multi-word primitives, concurrency, correctness, software testing}
}

Document

DOI: 10.4230/LIPIcs.OPODIS.2015.20

Robust Shared Objects for Non-Volatile Main Memory

Authors: Ryan Berryhill, Wojciech Golab, and Mahesh Tripunitara

Published in: LIPIcs, Volume 46, 19th International Conference on Principles of Distributed Systems (OPODIS 2015)

Abstract

Research in concurrent in-memory data structures has focused almost exclusively on models where processes are either reliable, or may fail by crashing permanently. The case where processes may recover from failures has received little attention because recovery from conventional volatile memory is impossible in the event of a system crash, during which both the state of main memory and the private states of processes are lost. Future hardware architectures are likely to include various forms of non-volatile random access memory (NVRAM), creating new opportunities to design robust main memory data structures that can recover from system crashes. In this paper we advance the theoretical foundations of such data structures in two ways. First, we review several known variations of Herlihy and Wing's linearizability property that were proposed in the context of message passing systems but also apply in our NVRAM-based model, we discuss the limitations of these properties with respect to our specific goals, and we propose an alternative correctness condition called recoverable linearizability. Second, we discuss techniques for implementing shared objects that satisfy such properties with a focus on wait-free implementations. Specifically, we demonstrate how to achieve different variations of linearizability in our model by transforming two classic wait-free constructions.

Cite as

Ryan Berryhill, Wojciech Golab, and Mahesh Tripunitara. Robust Shared Objects for Non-Volatile Main Memory. In 19th International Conference on Principles of Distributed Systems (OPODIS 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 46, pp. 20:1-20:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{berryhill_et_al:LIPIcs.OPODIS.2015.20,
  author =	{Berryhill, Ryan and Golab, Wojciech and Tripunitara, Mahesh},
  title =	{{Robust Shared Objects for Non-Volatile Main Memory}},
  booktitle =	{19th International Conference on Principles of Distributed Systems (OPODIS 2015)},
  pages =	{20:1--20:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-98-9},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{46},
  editor =	{Anceaume, Emmanuelle and Cachin, Christian and Potop-Butucaru, Maria},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2015.20},
  URN =		{urn:nbn:de:0030-drops-66116},
  doi =		{10.4230/LIPIcs.OPODIS.2015.20},
  annote =	{Keywords: non-volatile main memory, concurrency, recovery, data structures}
}

Search Results

Documents authored by Golab, Wojciech

Modular Recoverable Mutual Exclusion Under System-Wide Failures

Abstract

Cite as

Brief Announcement: On Implementing Wear Leveling in Persistent Synchronization Structures

Abstract

Cite as

Detectable Sequential Specifications for Recoverable Shared Objects

Abstract

Cite as

Toward Linearizability Testing for Multi-Word Persistent Synchronization Primitives

Abstract

Cite as

Robust Shared Objects for Non-Volatile Main Memory

Abstract

Cite as

Thanks for your feedback!

Could not send message